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Abstract 

We first present a general risk bound for ensembles that depends on the Lp norm 
of the weighted combination of voters which can be selected from a continuous 
set. We then propose a boosting method, called QuadBoost, which is strongly 
supported by the general risk bound and has very simple rules for assigning the 
voters’ weights. Moreover, QuadBoost exhibits a rate of decrease of its empirical 
error which is slightly faster than the one achieved by AdaBoost. The experimen¬ 
tal results confirm the expectation of the theory that QuadBoost is a very efficient 
method for learning ensembles. 


1 Introduction 

As data is becoming very abundant, machine learning is now confronted with the challenge of having 
to learn complex models from huge data sets. Among the learning algorithms which seem most 
likely to be able to scale up to meet this challenge are ensemble methods based on the idea of 
boosting weak learners [1]. Take AdaBoost [2] for example. If a weak learner is (almost always) 
able to produce in linear time a classifier achieving an empirical error just slightly better than random 
guessing, then the exponential rate of decrease of the training error of AdaBoost will give us a good 
majority vote in linear time. 

After AdaBoost was published, it soon became clear that infinitely many surrogate loss func¬ 
tions [3] and regularizers could be used for boosting and, without surprise, many variants have been 
proposed—to the point where the practitioner is often completely overwhelmed when confronted 
with the choice of picking a boosting algorithm for his learning task. Are some algorithms better 
than others? If so, then under what circumstances are they better? If not, then are they, somehow, 
all equivalent? In an attempt to answer these questions we have decided to search for a risk bound 
guarantee that applies to all ensemble methods, no matter what are the surrogate loss and regular- 
izer used by the algorithm. What comes out from the risk bound presented in the next section is 
the distinct difference between a Li norm regularize!' and all the other Lp norm regularizers with 
p > 1. This difference appears to be fundamental in the sense that the Rademacher complexity of a 
unit Lp>i norm combination of functions depends explicitly on the number of functions used in the 
ensemble while no such dependence occurs for the Li norm case. Consequently, an explicit control 
of the number of voters in the ensemble should be exercised while boosting with a Lp>i regularize^ 
but no such control is needed while regularizing with the Li norm. 

Concerning the issue of the surrogate loss to be used for boosting, we propose the simple quadratic 
loss (hence the name QuadBoost). Although the theory suggests using the hinge loss, this leads to 
linear programming algorithms which could become computationally prohibitive with large number 
of voters and huge data sets. The quadratic loss, on the other hand, leads to very simple rules for 
setting the weights on the voters, does not need to assign weights on the training examples, and 
exhibits a rate of decrease of the training error which is slightly faster than the one achieved by 
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AdaBoost. The use of the quadratic loss for boosting was already proposed by Biihlman and Yu 
[4] and its effectiveness has been analysed through the bias-variance decomposition. Here, instead, 
we analyse it through a risk loss bound and propose several regularized variants that were not con¬ 
sidered by Biihlman and Yu [4]. The experimental results confirm the expectation of the theory 
that QuadBoost is a very efficient method for learning ensembles and, consequently, is likely to be 
effective for learning complex models from data. 

2 A General Risk Bound for Ensembles 

We consider the difficult task of finding classifiers having small expected zero-one loss. In the 
supervised learning setting, the learner has access to a training set S = {(xi, yi),..., (xm, J/m)} 
of m examples where each example {xi,yi) is drawn independently from a fixed, but unknown, 
distribution D on X x y. For the binary classification case, the input space X is arbitrary, whereas 
the output space y = {—Given access to S, the task of the learner is to find, in reasonable 
time, a classifier f : X ^ y having a small expected zero-one loss ^ y), where 

I{a) = 1 if predicate a is true, and 0 otherwise. 

We are not only concerned here with the problem of finding classifiers with good generalization (i.e., 
small expected zero-one loss), but also with the running time complexity of finding such classifiers. 
In that respect, ensemble methods, such as AdaBoost [2], appear to us as mostly promising. Let us 
then investigate these methods with respect to both objectives. 

As is often the case with ensemble methods, we assume that we have access to a (possibly con¬ 
tinuous) set H of real-valued functions that we call the set of possible voters. Our task is to 
select from TL a finite subset of n voters on which a weighted majority vote classifier is pro¬ 
duced. Let h = {hi,hn) denote the vector formed by concatenating these n voters and let 
a = (ai,..., an) denote the vector of n real-valued weights used to weight the voters. For any 
input X G X, the output fa.h{x) on x of the weighted majority vote is given by 

= sgn(a-h(a;)) = sgn , 

where sgn( 2 ;) = -fl if z > 0, and —1 otherwise. 

To find out what the majority vote fa.h should optimize on the training data S to have good gen¬ 
eralization, we have investigated guarantees known as uniform risk bounds. In particular, those 
which are based on the Rademacher complexity are particularly appealing and tight. In a nutshell, 
the Rademacher complexity of a class of functions measures its capacity to fit random noise. More 
precisely, given a set S' of m examples, the empirical Rademacher complexity TZs{iF) of a class F 
of real-valued functions and its expectation TZ,n{F) are defined as 

- m 

TZs{F) ^ E sup-Va,/(xO ; TZ„,{F) ^ E TZs{F), 

where a = (cti, ..., am) and where each ai is a ±l-valued random variable drawn independently 
according to the uniform distribution. 

Given a weighted majority vote fa,\i of functions taken from a function class Ti., let ||q:||p denote 
the Lp norm of vector a for any p > 1 and let dim(a) denote the dimension {i.e., the number of 
components) of vector a. An important issue concerning majority votes is the complexity of the set 
of functions induced by taking weighted combinations of functions at fixed norm. Hence, given a 
class Ti. of real-valued functions, let us consider 

Cp{'H) = {xv^ a - \i{x) \ hi G H Vi, dim(Q:) = n, ||q;||p = 1} . 

We already know that TZs{Ci {TL)) = TZs{'H) for any n. But what happens if the weighted combi¬ 
nation is at unit Lp norm forp >1? The next lemma, which is apparently new, tells us that taking a 
weighted combination of functions strictly increases the Rademacher complexity for p > 1. 

We will assume, for the rest of the paper, that the class TL of voters is symmetric. This means that if 
voter h is in TL, then voter —is also in TL. 
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Lemma 1. For any symmetric class 'H of real-valued functions, any n G N, and any p G [1, +c»), 
we have 


ns{c;{n)) = n^--pns{n). 


Proof Let (l/g) = 1 — (1/p)- We will make use of the known fact that Holder’s inequality is 
attained at the supremum, namely for all p > 1 and any vector v, we have 


1/9 


sup '^aiVi = V'i'Uil 
“/i“iip=ii=i Vi=i 


Consequently, we have 


771 n 


TlsiCp) = E sup sup —y^aky^aihi{xk) 


E sup sup > ai 


^ m 

— akhi{xk) 

m ^' 


k=l 


E sup ( 


i=l 


y^^akhijxk) 


k=l 


1/9 


E f sup 


— y^akhi{xk) 


k=l 


9\ 1/9 


E n sup 

\ /tGW 


^ m 

E (Jkh{xk) 




= E n 


9\ 1/9 


^ m 9\ 1/9 

sup— 7 akh{xk) I (since "H is symmetric) 

) 


= n^/^ns{n) 


which proves the lemma. 


□ 


The next theorem, which is built on Lemma 1, constitutes the main theoretical result of the paper. 
It provides a uniform upper-bound on the expected zero-one loss of weighted majority votes fa.h 
in terms of their empirical risk (i.e., the expected loss estimated on the training data) measured 
with respect to any loss function C which upper-bounds the zero-one loss. The upper-bound also 
depends on the Lipschitz property of the clipped version of £. To define these notions precisely, let 
C{ya ■ h(a;)) denote the loss incurred by fa.h, as measured by £, on example {x, y). Then the loss 
incurred by fa.h, as measured by the clipped version of £, is defined to be lC{ya ■ h(a::))|i where 
|a:|i = min {x, 1). Finally, a function ,4 : M —>■ K is said to be f-Lipschitz for some £ > 0 if and 
only if \A{x) — A{x')\ < l\x — x'\ for any x and x'. 

Theorem 1. Consider any distribution D on X x y. Consider any loss function C which upper- 
bounds the zero-one loss and for which its clipped version is i-Lipschitz. Let PL be any symmetric 
class of real-valued functions on the input space X. For all p G [1, -|-c»), for all S G (0,1], with 
probability at least 1 — 5 over the random draws of S Z?™, we have simultaneously for all a on 

Pi, 


E /(ya-h(x)<0) < — • h(xi)) + 4£dim(a)^ ^ a 


1 7r^(dim(a) + 1)^ 

- lOff -^^ 

2m 6^ 


— loglog2 [2||a||J . (1) 

m 


As it is usual with Rademacher complexities. Theorem 1 also applies with TZm(fi) replaced by 
TZsiPi) if S is replaced by 5/2 and if the last term is multiplied by 3. 
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Proof. The fundamental theorem on Rademacher complexities (see, for example, Mohri et al. [5], 
Shawe-Taylor and Cristianini [ 6 ]) states that for any class Q mapping some domain Z to [0,1], for 
any distribution D on Z, for any 5 > 0, with probability at least 1 — <5, we have simultaneously for 
allg € g 

1 n. r 

E p(z) < — ^ 5 (z,) + 27^^(a) + t/ —log-. 
z~D m V 2to 0 

2 — 1 

Given any £, any p G [1, +oo], and any 7 > 0, we can apply this theorem to the set of functions 
that maps each example (x, y) to \C{y^ ■ h(a;))]i for ||q||p = 1 when each hi G %. By hypothesis, 
the clipped version of C is £-Lipschitz. Hence, by Talagran’s lemma (see, for example. Theorem 4.2 
of Mohri et al. [5]), we have that TZs{G-y) = £TZs{G!y), where is the set of functions mapping 
{x,y) toy^ ■ h(a;). 

Also, since 7 is a constant and y = ± 1 , we have that TZs{G!y) = {^/l)T^s(Cp) where is the 
set of functions mapping a: to q • h(a:) such that ||q|lp = 1 and dim(q) = n. Thus, TZs{G-y) = 
{£/^)TZs{Cp) = {£/^) dim{qy~^^/p'>TZs{'H) according to Lemma 1. 

Then, for any 7 > 0, with probability at least 1 — <5 we have 

1 I i T 

E |£(?/--h(a;))li < — ^|£(j/,--h(xi))li+ 2 (f/ 7 ) dim(q)l-(l/^’)7^„('H) + W — log - . 
(x,y)~D 7 m ^ 7 V 2m d 

By using the union bound technique of Theorem 4.5 of Mohri et al. [5], we can make the above 
bound valid uniformly over all values for 7 by adding ( 1 /m) log log 2 ( 2 / 7 ) to its right hand side 
and by multiplying the TZmihL) term by 2. To obtain a bound which is also valid uniformly for all 
values of n = dim(q), we replace <5 by ( 6 ( 5 / 7 r^)(n + 1)“^. Then, by using a = q/ 7 , we have that 
||a||p = 1 / 7 . The theorem then follows from the fact that I{ya ■ h(a;) < 0) < [£(j/a • h(a;))]i < 
C{ya ■ h(x)) V(a:, y) G X x y. □ 

If we ignore the slowly increasing logarithm terms in Equation (1), Theorem 1 tells us that to obtain 
a majority vote with a small zero-one loss, it is sufficient to minimize the empirical risk, as measured 

with respect to a surrogate loss £, plus a regularization term equal to 4('dim(a)^~p WaWpTZmihl). 
Note that when p = 1, this regularization term is equal to A£\\a\\iRrn{'H) and, therefore, does 
not depend on the number dim(Q:) of voters used by the majority vote. Hence, when performing 
Li regularization with a fixed set T-L of voters and surrogate loss C, the only thing that matters is 
to control the Li norm of a while minimizing the empirical risk. This is in sharp contrast with 
the p > 1 cases where we need to control both the Lp norm of a and the number dim(a) of 
voters used by fa.ti- Therefore, iterative learning algorithms that minimize the empirical loss under 
Lp regularization should also perform early stopping or exercise some other explicit control on 
the number of voters used by the majority vote when p > 1. This explicit control on dim(a) is 
mostly important with Loo regularization because the regularization term then grows linearly with 
dim(a). But it is also important with L 2 regularization since the regularization term then grows 
with A/dim(a). Finally, and perhaps most importantly, just minimizing iteratively the empirical 
risk and using early stopping to control overfitting is a simple learning strategy that is supported by 
Theorem 1 . Indeed, as long as the iterative procedure does not choose large weights for the voters, 
early stopping keeps ||a||i under control and the right hand side (r.h.s.) of Equation (1) should be 
small when ||a||i ^ \/rn (since TZm{T~L) G 0{l/y/m) when % has finite VC dimension) and the 
empirical risk of the ensemble has reached a small value. 

The other important issue regarding Theorem 1 concerns the choice of the surrogate loss C. Obvi¬ 
ously, the closer (or tighter) C is to the zero-one loss, the better. However, to avoid computational 
problems associated with the existence of several local minima of the empirical risk, let us settle for 
a surrogate C convex in a. One of the tightest convex surrogate that we can think of is the hinge 
loss. However, the presence of a discontinuity in its first derivative does not give rise to a simple 
boosting-type iterative algorithm, sometimes called forward stagewise additive modelling [7], that 
attempt to produce a majority vote by iteratively adding voters chosen from a possibly continuous 
set T-L. The hinge loss does give rise to a linear programming algorithm, called LPBoost [ 8 ], when 
used in conjunction with Li regularization. Although this learning strategy is strongly supported by 
Theorem 1, solving iteratively a linear program each time a new voter is inserted into the ensemble 
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is a computationally expensive strategy when the number of voters in the ensemble is large. As ma¬ 
chine learning is entering into the Big Data era, we want the restrict ourselves to forward stagewise 
additive modelling algorithms. One solution is to use a smoother surrogate, containing no disconti¬ 
nuities in its derivatives, at the price of sacrificing a bit of the tightness with respect to the zero-one 
loss. The exponential loss minimized by AdaBoost [2] has continuous first derivatives but is very far 
from being a tight upper bound of the zero-one loss. The logistic loss minimized by LogitBoost [9] 
is much better in that respect but does not produce simple update rules in the sense that each exam¬ 
ple of the training set needs to be reweighed each time a new voter is added to the ensemble. Are 
there simpler updates rules that perform as well as AdaBoost and LogitBoost? Let’s try to answer 
this question by analyzing what happens if we use the very simple quadratic loss for the surrogate 
C. The quadratic loss is commonly used in classical learning methods such as (kernel) ridge regres¬ 
sion and back-propagation neural network learning and has been considered by Hastie et al. [7] for 
forward stagewise additive modelling algorithms. But these authors do not recommend the use of 
the quadratic loss as a surrogate for the zero-one loss and propose, instead, the use of more robust 
losses against outliers such as the squared hinge loss. However, such “Huberized” losses do not give 
rise to simple boosting algorithms such as those obtained with the square loss below. The square 
loss has been considered for Boosting by [4] (under the name “Boosting with the loss”) and have 
obtained excellent performances when using cubic splines. However, they have not considered any 
Lp-regularized variant of Boosting. More recently, Germain et al. [10] considered boosting with the 
quadratic loss with a Kullback-Leibler regularize!'. Consequently, their boosting algorithm turned 
out to be different than the algorithms proposed in the next section. Moreover, their PAC-Bayesian 
theory based on quasi-uniform posteriors was developed only for the case of a finite set of voters and 
does not extend to the continuous case. Hence, it was not realized that it was necessary to control the 
number of voters when boosting with a Lp regularizer with p > 1, while no such control is needed 
for p = 1. 


3 QuadBoost 


We now investigate if the quadratic loss can yield simple and efficient iterative algorithms for pro¬ 
ducing ensemble of voters. For this task, consider any n G N, any vector h = {hi ,..., hn) of voters 
where each hi € %, and any vector a = (ai,..., a„) of real-valued weights on h. Let us start by 
writing the quadratic risk (on m examples) as 


- Y^iVk - a • h{xk)f = 1 - 2^ I] Vkhjixk) + “I “ II 

k—1 j—1 k—1 j—1 k—1 

n ^ m J “ 1 

+ 2 hj{p^k^ ^ ^ ■ ( 2 ) 

If, for each voter hj of the ensemble, we now define its margin fXj as 

^ m 

l^j — ^ ^ VkhjiXk) 5 

/c=l 

and its correlation Mj with the weighted sum of the previous voters as 

M ^ \ ^ Oi^h,{xk) if J > 1 

' [ 0 ifj = l, 

we obtain the following decomposition of the quadratic risk 
1 


(3) 


(4) 


I I L I L IL ^ I I L 

■'^{yk-a-h{xk))'^ = l-2'^aj{fj.j-Mj)+'^a'^jf]j , whsrerjj = —'^h‘^{xk) ■ (5) 

k—1 j—1 j—1 k—1 

This decomposition tells us that to minimize the quadratic risk iteratively, we should, at each step 
j, find a voter hj G T-L that maximizes \pLj — Mj\. Once a voter hj is chosen, its weight aj that 
minimizes the quadratic risk is obtained by setting to 0 its partial derivative with respect to . This 
gives 


= —{pj — Mj) ; (without regularization). 
hj 


(6) 
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Note that the sign of aj is given by the sign of [fjij — Mj). Also, it is easy to verify that adding 
to the ensemble a voter hj with weight aj given by Equation ( 6 ) decreases the empirical quadratic 
risk of the ensemble by {fj,j — MjY/rjj. When it is computationally very expensive to find hj G TL 
maximizing |/ij — Mj\, we can settle to find more rapidly any hj having — Mj\ > 0 since, 
in that case, we still make progress by lowering the empirical quadratic risk of the ensemble by 
ihj - Mjf/r]j. 

Vanilla QuadBoost: Inserting into the ensemble, at each step j, a voter hj G TL with the weight 
aj given by Equation ( 6 ), defines, what we call, the vanilla version of QuadBoost. Of course, in 
that case. Theorem 1 tells that we should eventually early stop this greedy process to avoid over 
fitting-which is also the case for AdaBoost. 

One reason often invoked for using AdaBoost is its exponentially fast decrease of the empirical error 
as a function of the number of iterations (boosting rounds). More precisely, assuming that, at each 
iteration, the weak learner can always produce a classifier achieving a training error (on the weighted 
examples) of at most (1/2) — 7 , the (zero-one loss) training error produced by the AdaBoost ensem¬ 
ble is at most exp(— 27 ^T) after T iterations [2]. Consequently, the number of iterations needed for 
AdaBoost to obtain an ensemble achieving less than e empirical error is [ 1 /( 27 ^)] log(l/e). 

In comparison, under the equivalent assumption that the weak learner is always able to find a voter 
hj G TL where \^j — Mj\ > 7 , the decrease in the quadratic empirical risk (which upper-bounds 
the zero-one training error) achieved by the QuadBoost ensemble is at least (/ij — Mj)'^ (since 
rjj = 1 for classifiers) at each iteration. Hence, under this hypothesis, the training error produced 
by the QuadBoost ensemble after T iterations is at most 1 — T^'^. Hence, under this hypothesis, 
QuadBoost needs at most ( 1 / 7 ^) iterations to have an an ensemble achieving at most e training error. 
Consequently, in comparison with AdaBoost, and under an equivalent hypothesis, the convergence 
rate of QuadBoost is slightly better. 

Let us now investigate the different Lp regularized versions of QuadBoost for p = 1,2, and -foo, in 
accordance with the insights given by Theorem 1 . To this end, we first note that the clipped quadratic 
loss is 2-Lipschitz, so ^ = 2 in Theorem 1 when C is the quadratic loss. 

QuadBoost-Li: Eorp = 1, we should minimize the empirical quadratic risk plus 2A||a||i, where, 
according to Theorem 1, A should be equal to ATZmiTL) But, in practice, a smaller value for A 
should provide better results as there are always some looseness in risk bounds. If we add this 
regularization term to the expression of the empirical risk given by Equation (5) and then set to zero 
the first derivative w.r.t. Uj of this objective, we find that, at each step j, the solution for aj is given 
by 

[ j-ihj-Mj-X) IE pj-Mj>\ 

- hj - >') IF (7) 

0 IE \fj,j — Mj\ < A {Li regularization). 

It can be verified that this update rule gives a decrease of {\pj — Mj \ — X)^/rjj in the risk bound 
value of Theorem 1 when \p,j — Mj\ > X. Here, no explicit early stopping is needed as, for some 
chosen A > 0, we will eventually be unable to find a voter hj having \fj,j — Mj\ > X. Hence, the 
amount of voters contained in the ensemble is controlled by parameter A; the larger A is, the smaller 
the ensemble will be. Einally note that this algorithm can be viewed as an iterative version of the 
LASSO method [11, 12] but where the functions are selected from a possibly continuous set TL. 

QuadBoost-L 2 : Eorp = 2 , we should minimize the empirical quadratic risk plus A||q:|| 2 , where, 
according to Theorem 1 , A should be equal to 8 i/dim/cQT^m {TL). If we add this regularization term 
to the expression of the empirical risk given by Equation (5) and then set to zero the first derivative 
w.r.t. aj of this objective, we find that, at each step j, the solution for aj is given by 

aj = - \{hj ~ ^^ 3 ) j (T 2 regularization). ( 8 ) 

Pj -I- A 

It can be verified that this update rule gives a decrease of {pj—Mj)'^/{pj+X) in the risk bound value. 
Since according to theory, A should increase with the number dim(a) of voters in the ensemble, 
explicit early-stopping should be performed in addition to the above rule for aj. Einally note that 
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this algorithm can be viewed as an iterative version of ridge regression but where the functions are 
selected from a possibly continuous set T-L. 


QuadBoost-Loo: For p = +oo, we should minimize the empirical quadratic risk plus A||a||oo, 
where, according to Theorem 1, A should be equal to 8 dim{a)TZmi'H)- Since ||a||oo = ctmax, 
which is the allowed upper-bound for the weight values of the voters, each aj should minimize the 
empirical risk provided that its absolute value does not exceed amax- The solution for aj is then 
given by 

IF 

amax sgnifij - Mj) IF ^ |pj - Mj \ 

Since according to theory, A should increase linearly with the number dim(a) of voters in the en¬ 
semble, explicit early-stopping should be performed in addition to the above rule for aj. 



^ ar, 


> a^ax', (Too regularization). 


(9) 


3.1 Re-weighting the voters in the ensemble 


Up to now, we have been assuming that each voter added to the ensemble is new and different from 
all the others already present. Even though this should generally be the case when the class T-L is 
infinitely large, the risk bound of Theorem 1 tells us that it might be advantageous to occasionally 
re-weight the voters already present in the ensemble. Indeed, in that way, we do not change dim(a) 
in this process and only change ||a||p so as to decrease the risk bound. Hence, let us investigate the 
update rules for the voter’s weights in that case. 


Let us consider that we have n voters currently in the ensemble and we want to change the weight 
aj of voter hj. Let us denote by a' the weight vector formed by the weights of all voters in the 
ensemble except voter hj. The quadratic empirical risk can now be written as 


^ IIL 

— y^iVk-ot- h(a;fc))^ 
m 


^ 11 L 

— y^iVk - a' ■ h{xk) - ajhj{xk)f 

^ k=l 


= c- 2 q;j— ^ hj{xk){yk - a' ■ h(a;fc)) + ^ h^j{xk) 

^ k^l k^l 

= c — 2aj{fj.j — Mj) a^jijj , (10) 

where c is a term that does not depend on aj, pj and pj retain the same definitions as before, and 
Mj is now defined as the correlation of voter hj with a' ■ h. 


Vanilla QuadBoost: In this case, we set aj so as to minimize only the quadratic risk above. This 
gives the same uptate rule as before; namely, aj = {pj — Mj)/pj. Since, this could increase ||a||p. 
Theorem 1 tells us, that we should eventually stop updating the weights to avoid overfitting. 


QuadBoost Lp‘. Lor these cases, adding the a^-dependent regularization term, gives the same 
update rules as before except that Mj is now defined as the correlation of voter hj with a' ■ h. Lor 
the Li case, this means that a weight could eventually be set to zero, then yielding a sparser solution. 


4 Experimental Results 

Let us first note that, although all the boosting algorithms tested in this section can select the voters 
from a continuous set, we have, for the sake of comparison, used only finite sets. We feel that 
continuous sets of voters raise several important issues, including a non trivial trade off between 
precision, time complexity, and capacity (or Rademacher complexity), which clearly need extra 
work to be lucidly addressed and, consequently, should be treated in a separate paper. 

We now report empirical experiments on binary classification datasets from the UCI Machine Learn¬ 
ing Repository [13]. Each dataset was randomly split into a training set S of at least half of the 
examples and at most 500 examples, and a testing set containing the remaining examples. Each 
dataset has been normalized using a hyperbolic tangent, whose parameters have been chosen us¬ 
ing the training set S only. We considered a finite set of decision stumps (one-level decision trees) 
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which consists of a single input attribute and a threshold. For each dataset, we generated 10 decision 
stumps per attribute and their complements. 

We compared QuadBoost (Li, L 2 , Loo, and its vanilla version that has no regularize!') with LP- 
Boost [8], and AdaBoost [2]. For each algorithm, all hyperparameters have been chosen amont 10 
values in a logarithmic scale, by performing 5-fold cross-validation on the training set S, and the 
reported values are the risks on the testing set. 

For QuadBoost-Li, A was chosen in a range between 10“^ and 10°. For QuadBoost-L 2 , the range 
for A was between 10° and 10°, and the range for the number of iteration T was between 10° and 
10°. For QuadBoost-Loo, Umax was chosen between 10“'° and 10“°, and T between 10° and 10°. 
Vanilla QuadBoost’s T was chosen between 10° and 10°, hyperparameter C of LPBoost was chosen 
between 10“° and 10°, and finally the number of iterations T of AdaBoost was chosen between 10^ 
and 10°. 


Dataset 

QuadBoost-Li 

QuadBoost-L 2 

QuadBoost-Loo 

QuadBoost 

LPBoost 

AdaBoost 


australian 

0.145 

0.145 

0.145 

0.145 

0.145 

0.191 

balance 

0.035 

0.022 

0.029 

0.026 

0.029 

0.035 

breast 

0.054 

0.049 

0.046 

0.046 

0.043 

0.046 

bupa 

0.279 

0.285 

0.308 

0.297 

0.349 

0.372 

car 

0.164 

0.160 

0.150 

0.153 

0.155 

0.130 

cmc 

0.300 

0.292 

0.296 

0.306 

0.311 

0.310 

credit 

0.133 

0.133 

0.133 

0.133 

0.128 

0.165 

cylinder 

0.285 

0.311 

0.270 

0.278 

0.278 

0.281 

ecoli 

0.065 

0.065 

0.083 

0.089 

0.113 

0.095 

flags 

0.309 

0.278 

0.309 

0.268 

0.299 

0.247 

glass 

0.159 

0.140 

0.150 

0.150 

0.290 

0.215 

heart 

0.200 

0.178 

0.200 

0.215 

0.230 

0.230 

hepatitis 

0.221 

0.221 

0.221 

0.221 

0.221 

0.182 

horse 

0.207 

0.163 

0.158 

0.207 

0.207 

0.185 

ionosphere 

0.091 

0.126 

0.091 

0.120 

0.120 

0.114 

letter_ab 

0.010 

0.006 

0.006 

0.006 

0.011 

0.010 

monks 

0.231 

0.231 

0.231 

0.231 

0.231 

0.255 

optdigits 

0.086 

0.077 

0.074 

0.078 

0.096 

0.081 

pima 

0.258 

0.237 

0.245 

0.268 

0.253 

0.273 

tictactoe 

0.315 

0.313 

0.313 

0.322 

0.342 

0.317 

titanic 

0.228 

0.228 

0.228 

0.228 

0.222 

0.222 

vote 

0.051 

0.051 

0.051 

0.051 

0.051 

0.051 

wine 

0.079 

0.056 

0.056 

0.090 

0.067 

0.045 

yeast 

0.283 

0.290 

0.296 

0.296 

0.300 

0.300 

zoo 

0.040 

0.100 

0.040 

0.040 

0.120 

0.120 

Mean running time (seconds) 

1.326 

74.040 

1.131 

0.397 

26.930 

8.096 


Table 1; Testing risks of four versions of QuadBoost, compared with LPBoost and AdaBoost. The 
bold value corresponds to the lowest testing risk among all algorithms. The last line reports the 
mean running time of the algorithms for all datasets. 


Table 1 reports the resulting testing risks and training times. The results show that all four variants 
of QuadBoost that we considered are competitive with state-of-the-art boosting algorithms. When 
comparing vanilla QuadBoost with AdaBoost, where the only hyperparameter to tune is the number 
of iterations, QuadBoost wins or ties 17 times over 25 datasets and is 20 times faster. When com¬ 
paring QuadBoost-Li with LPBoost, which is also a Li-norm regularized algorithm, QuadBoost 
outperforms or ties with LPBoost 16 times over 25 datasets and is also 20 times faster. 

Table 2 shows a statistical comparison between all these algorithms, using the pairwise Poisson 
binomial test of Lacoste et al. [14]. Given a set of datasets, this test gives the probability that a 
learning algorithm is better than another one. This table also shows the pairs of algorithms having 
a significant performance difference using the pairwise sign test [15]. The only significant values 
indicate that QuadBoost with Loo-norm and L 2 -norm regularization outperform QuadBoost without 
regularization, and that QuadBoost with L 2 -norm regularization outperforms QuadBoost with Li- 
norm regularization. Note however that for the two former algorithms, we have performed cross- 
validation over two hyperparameters, which gave them an advantage. Another possible explanation 
for the observed improved performance of the L 2 and Loo regularized versions is the increased 
Rademacher complexity of L 2 and Loo combinations over Li combinations. 
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QuadB.-Loo 

QuadB.-L 2 

LPBoost 

AdaBoost 

QuadB. 

QuadB.-Li 

QuadBoost-Loo 

0.50 

0.48 

0.49 

0.55 

0.80* 

0.81* 

QuadBoost-L 2 

0.52 

0.50 

0.44 

0.45 

0.75 

0.84* 

LPBoost 

0.51 

0.56 

0.50 

0.42 

0.69 

0.67 

AdaBoost 

0.45 

0.55 

0.58 

0.50 

0.58 

0.57 

QuadBoost 

0.20* 

0.25 

0.31 

0.42 

0.50 

0.48 

QuadBoost-Li 

0.19* 

0.16* 

0.33 

0.43 

0.52 

0.50 


Table 2: Pairwise Poisson binomial test between all pairs of algorithms. A gray value indicates 
redundant information, and a star indicates that the difference between the two algorithms is also 
significant using the pairwise sign test, with a p-value of 0.05. 


In conclusion, the empirical experiments show that QuadBoost is a fast and accurate ensemble 
method that competes well against other state of the art boosting algorithms. 

5 Conclusion 

We have presented a uniform risk bound for ensembles which holds for any surrogate loss and 
Lp norm of the weighted combination of voters which can be selected from a continuous set. An 
important feature of this result is the fact that weighted combinations of unit Lp norm for p > 1 
have strictly larger Rademacher complexity than weighted combinations of unit Li norm and, as a 
consequence, the risk bound exhibits an explicit dependence on the number of voters when p > 1 
while no such dependence occurs when p = 1. This result suggests to perform an explicit control 
of the number of voters when regularizing with the Lp norm for p > 1 while no such control is 
needed for p = 1. Finally, our theoretical and empirical results suggest that the simple quadratic 
loss surrogate should be used for boosting instead of the usual exponential loss. 
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